Team Information

Project Title - Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Group Number - 21

Names -

Kumar Saurabh (ksaurabh@iu.edu)

Shubham Thakur (sbmthakur@iu.edu)

Ameya Dalvi (abdalvi@iu.edu)

Vishwa Shrirame (vshriram@iu.edu)

Team Photos

WhatsApp Image 2021-11-16 at 9.49.21 PM.jpeg

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Dividing the train dataset into train and validation sets

Summary Statistics -

Determine the categorical and numerical features

Missing data Analysis

Missing data for application train

Bar plot of missing values in each column

We notice there are a lot of missing values in the dataset

Bar plot of the percentage of missing values in each column

Imputing Missing data

We will construct the numerical pipeline and categorical pipeline

Correlation Analysis

Correlation with the target column

Pair plot of the 4 top correlated features

We can see the plot of the top 4 correlated attributes with the Target column. EXT_SOURCE_1 seems to be normally distributed while others are skewed but can be approximated to normal distribution.

Heatmap of correlated attributes

Bar plot of correlated attributes

Visual EDA

Distribution of the target column

Evaluating categorical features with respect to TARGET

TARGET - 0: LOAN WAS REPAID 1: LOAN WAS NOT REPAID

Categorical distribution

Numerical distribution

We notice that there are many outliers in the data as seen in the box plot. We can visualize the median and the quantiles of each column data by these box plots.

Histogram plot shows the distribution of data over a range. We have visualized each numerical data column's data distribution.

Applicants Age

Here we can conclude that people of age 30-50 take more loan applications

Applicants occupations

Laborers require more loans as compared to other occupation type people

Dataset questions

Unique record for each SK_ID_CURR

Input Features -

Modeling

Baseline Logistic Regression -

The objective function for the learning a binomial logistic regression model (log loss) can be stated as follows:

$$ \underset{\mathbf{\theta}}{\operatorname{argmin}}\left[\text{CXE}\right] = \underset{\mathbf{\theta}}{\operatorname{argmin}} \left[ -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} \right] $$

The corresponding gradient function of partial derivatives is as follows (after a little bit of math):

$$ \begin{aligned} \nabla_\text{CXE}(\mathbf{\theta}) &= \begin{pmatrix} \frac{\partial}{\partial \theta_0} \text{CXE}(\mathbf{\theta}) \\ \frac{\partial}{\partial \theta_1} \text{CXE}(\mathbf{\theta}) \\ \vdots \\ \frac{\partial}{\partial \theta_n} \text{CXE}(\mathbf{\theta}) \end{pmatrix}\\ &= \dfrac{2}{m} \mathbf{X}^T \cdot (\hat{p}_y - \mathbf{y}) \end{aligned} $$

For completeness learning a binomial logistic regression model via gradient descent would use the following step iteratively:

$$ \mathbf{\theta}^{(\text{next step})} = \mathbf{\theta} - \eta \nabla_\text{CXE}(\mathbf{\theta}) $$

Baseline Decision Tree Classifier -

Cost functions used for classification and regression.

In both cases the cost functions try to find most homogeneous branches, or branches having groups with similar responses.

Regression : sum(y — prediction)²

Classification : G = sum(pk * (1 — pk))

A Gini score gives an idea of how good a split is by how mixed the response classes are in the groups created by the split. Here, pk is proportion of same class inputs present in a particular group.

DecisionTreeClassifier

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Information gain uses the entropy measure as the impurity measure and splits a node such that it gives the most amount of information gain. Whereas Gini Impurity measures the divergences between the probability distributions of the target attribute’s values and splits a node such that it gives the least amount of impurity.

Gini : $\Large 1 - \sum^m_{i=1}(P_j^2)$

Entropy : $\Large \sum^m_{i=1}\left(P_j\cdot\:\log\:\left(P_j\right)\:)\right)$

Feature importance formula

To calculate the importance of each feature, we will mention the decision point itself and its child nodes as well. The following formula covers the calculation of feature importance.

For each decision tree, Scikit-learn calculates a nodes importance using Gini Importance, assuming only two child nodes (binary tree):

$\Large ni_j = w_jC_j - w_{left(j)}C_{left(j)} - w_{right(j)}C_{right(j)}$

Where

ni_j= the importance of node j

w_j = weighted number of samples reaching node j

C_j= the impurity value of node j

left(j) = child node from left split on node j

right(j) = child node from right split on node j

The importance for each feature on a decision tree is then calculated as:

$\Large fi_i = \frac{\sum_{j:node \hspace{0.1cm} j \hspace{0.1cm} splits \hspace{0.1cm} on \hspace{0.1cm} feature \hspace{0.1cm} i}ni_j}{\sum_{k \hspace{0.1cm} \epsilon \hspace{0.1cm} all \hspace{0.1cm} nodes }ni_k}$

$fi_i$ is feature importance for $i^{th}$ feature

These can then be normalized to a value between 0 and 1 by dividing by the sum of all feature importance values:

$\Large normfi_i = \frac{fi_i}{\sum_{j \hspace{0.1cm} \epsilon \hspace{0.1cm} all \hspace{0.1cm} features}fi_j}$

Reference: https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3#:~:text=Feature%20importance%20is%20calculated%20as,the%20more%20important%20the%20feature.

Decision Trees Parameters for classification

Baseline Random Forest Classifier -

Random Forest Parameters

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest). More precisely, it is a weighted average, where each node’s weight is equal to the number of training samples that are associated with it.

Feature Importances of Random Forest Classifier

Feature Engineering

Bureau table

Our Model Pipeline

Feature 1 : Percentage of Active Loans

Feature 2: Credit income percentage feature

Feature 3: Credit Annuity Ratio of the Application

How many years does it take for the borrower to repay the amount he asked for the application -

Feature 4: Income Annuity Ratio of the Application

Feature 5: Average of "Days past due" per customer

Credit card balance data

Feature 6: Debt Overdue per Customer

We believe that this sudden spike of accuracy is due to overfitting and thus we will keep the last test accuracy i.e 92.13% as our result from feature engineering

Random Forest Classifier (Ensemble method)

Feature Importances on ensemble method

As per the Feature Importances plot we conclude that Credit Annuity Ratio of the Application Feature(Feature no 3) is the most important feature

Hyperparameter Tuning

Hyperparameter tuning for logistic regression

Best parameters for Logistic Regression

After performing Hyperparameter tuning for logistic regression we are getting an improved test accuracy of 92.46%.

Hyperparameter tuning for random forest classifier

Best parameters for Random Forest Classifier

After performing Hyperparameter tuning for random forest classifier we are getting an improved test accuracy of 92.975%.

Neural Network Implementation (Phase 3)

Multi-layer perceptron

Single-layer perceptron

Training for Multi-layer perceptron

Training for Single-layer perceptron

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Preprocessing for Feature Engineering ML Model

Preprocessing for Neural Network model

Kaggle submission via the command line API

Our Submission

file:///N/u/vshriram/Carbonate/Desktop/kagglePhase3.pngimage-2.png

report submission

Click on this link

Abstract

The main goal of this project is to build an optimal ML model which will predict if a loan applicant will be able to repay his/her loan. In Phase 2, we extended our work from Phase 1 and implemented Feature Engineering where we consider potential features from other tables, did feature selection from the derived features, analysis of feature importances and implemented Hyper Parameter Tuning. In feature engineering, we derived 6 additional features and we achieved an improvement over our baseline model however some feature models are leading to over-fitting and thus in our future scope we aim to select the most important features having balanced data to avoid over-fitting. The most important feature relevant to our goal was the Credit Annuity ratio of the current application. We identified this by implementing feature importances on our model. After performing Hyper-parameter tuning on logistic regression(our best pipeline) we achieved the test accuracy of 92.46%. In Phase 3, we extended our work from Phase 2 and we performed a deep learning algorithm which will predict our goal of the project. The Deep Learning Algorithm used was the Multi-Layer Perceptron which is a kind of Artificial Neural network. The main goal of this phase was to implement MLP and visualize the training model on TensorBoard. We also identified Data Leakage in our project and their respective reasons. Another goal of this phase was to improve our results from Phase 2 and we were successful in doing that.

Introduction

Data Description:

The main table is divided into two files: Train (with TARGET) and Test (without TARGET) (without TARGET).

All past credit issued to the client by other financial institutions and reported to the Credit Bureau.

Monthly balances of previous credits in Credit Bureau.

Monthly balance snapshots of the applicant's prior POS (point of sale) and cash loans with Home Credit.

Monthly balance snapshots of the applicant's prior credit cards with Home Credit.

All prior Home Credit loan applications of clients with loans in our sample.

Payment history in Home Credit for previously disbursed credits related to the loans in our sample.

The columns in the various data files are described in this file.

The tasks to be tackled are:

In Phase 2, we implemented Feature engineering to identify potential features that could help us get better results. We mainly derived 6 features:

Untitled%20Diagram%20%285%29.jpg

Pipelines

Here we created two different pipelines for numerical and categorical features respectively. Performed standardization and imputation on the numerical features and performed imputations and one-hot encoding on the categorical features. We combined the two pipelines using Column Transformer and passed for modeling.

We are passing the combined data pipeline to Logistic Regression model in this pipeline

Here we are passing the data pipeline to the Random Forest classifier

Feature Engineering

We derived 6 features in total out of which we selected 5 for our modeling.

  1. Active loan Percentage feature
  2. Credit Income Ratio feature
  3. Credit Annuity Ratio of the Application
  4. Income Annuity Ratio of the Application
  5. Average of "Days past due" per customer
  6. Debt Overdue per Customer

Impact of these features to the model -

  1. Active loan Percentage feature
    • We achieved an improved result using this feature over our baseline model.
  2. Credit Income Ratio feature
    • We achieved a slight improvement in the testROC using this feature.
  3. Credit Annuity Ratio of the Application
    • This feature was identified as the most important using Feature Importances on the model.
  4. Income Annuity Ratio of the Application
    • We achieved a slight improvement in the testROC using this feature.
  5. Average of "Days past due" per customer
    • We achieved a significant improvement in the testROC using this feature.
  6. Debt Overdue per Customer
    • We are receiving a spike in the result using this feature. We think this is because of overfitting.

Neural Network

Multi-Layer Perceptron

Loss function used in the Neural Network -

Cross Entropy Loss

Cross Entropy Loss criterion computes the cross entropy loss between input and target.

It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

In our project, we do have a unbalanced training set as we have about 270000 target variables as 0 (i.e people who havent repaid their loan) among 300000 data points.

Equation -

Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:

H(P, Q) = – sum x in X P(x) * log(Q(x))

Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm

Single Layer Peceptron Model

In our Neural network we have used 14 input features and 1 linear hidden layer having 20 neurons and 2 output features.

In the forward propagation we have used leaky_relu as the activation function

Multi Layer Perceptron Model

In our Neural network we have used 14 input features and 2 linear hidden layers having 20 neurons each and 2 output features.

In the forward propagation we have used leaky_relu as the activation function

We received a test accuracy of 83.53% using the Neural network model

Visualization of training model in TensorBoard

file:///N/u/vshriram/Carbonate/Desktop/tensorboard.pngimage.png

Leakage

There has been some data leakage in our implementation. I'll state these pointwise -

Modeling Pipelines

Families of Input Features -

Hyperaparameters considered for Logistic Regression -

Loss function used - Cross Entropy Loss

Cross Entropy Loss criterion computes the cross entropy loss between input and target.

It is useful when training a classification problem with C classes. If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

In our project, we do have a unbalanced training set as we have about 270000 target variables as 0 (i.e people who havent repaid their loan) among 300000 data points.

Equation -

Cross-entropy can be calculated using the probabilities of the events from P and Q, as follows:

H(P, Q) = – sum x in X P(x) * log(Q(x))

Where P(x) is the probability of the event x in P, Q(x) is the probability of event x in Q and log is the base-2 logarithm

Experimental results

Before Hyper-Parameter Tuning

Screen%20Shot%202021-12-07%20at%209.13.16%20PM.png

After Hyper-Parameter Tuning

Screen%20Shot%202021-12-07%20at%209.35.56%20PM.png

Screen%20Shot%202021-12-07%20at%209.56.40%20PM.png

Screen%20Shot%202021-12-07%20at%209.59.23%20PM.png

Improved Kaggle Score of Phase 2

image.png

Confusion Matrix of our Neural network model

Test Accuracy Score

Kaggle submission of our Neural network model

image.png

Discussion Experimental Results

In Phase 2, we mainly considered Logistic Regression and Random Forest Classifier as our candidate models based on last phases' result.

Additionally, as Feature Engineering was to be done in this particular phase, we derived 6 new candidate features from the given set of datasets out of which we selected 5 for our model training.

Including these Features helped increase the Test accuracy of Logistic Regression from 91.93% to 92.08% as it can be seen from the results.

This accuracy was before Hyperparameter Tuning. We then performed Hyperparameter Tuning to further improve our insights on the model.

HyperParameter tuning helped improve the Test Accuracy for Logistic Regression from 92.08% to 92.46%

Important Callout: The last feature that we did not select for our model training was Debt Overdue per Customer. Selecting this feature caused a sudden spike in our test accuracy from 92.08% to 97.00%. We believe that this sudden spike of accuracy is due to overfitting and thus we will keep the last test accuracy i.e 92.13% as our result from feature engineering.

Similarly for Random Forest Classifier, we were able to achieve test accuracy of 92.975%

Phase 3 -

In this Phase 3 we first improved our results from Phase 2 and we achieved a Kaggle submission this time of 0.73136 from our previous score of 0.6491.

We implemented deep learning in this phase and we employed a Multi-Layer Perceptron as our model.

We received the train accuracy of 83.91% after 1000 epochs and a test accuracy of 83.53%.

We then visualized our training model on TensorBoard and generated a plot of loss function and accuracy.

We also implemented a Single Layer perceptron model and achieved a slight less accurracy as compared to multi layer.

Conclusion

The main purpose of this project is to create a Machine Learning model that can predict whether or not a loan applicant will be able to repay the loan. Many worthy applicants with no credit history or default history are getting without any statistical analysis. The ML model in our work is trained using the HCDR dataset. It will be able to predict whether an applicant will be able to repay his loan or not based on the history of similar applicants in the past. This would help in filtering applicants with a good statistical backing derived from various factors that are taken into consideration. This would help both, a worthy applicant in securing a loan and the bank to grow their business further. We perfomed feature engineering, feature selection and hyperparameter tuning to improve our classification model to accurately predict whether the loan applicant is able to repay his loan or not. We identified that the Credit Annuity Ratio of the Application feature as the most important feature in our implementation. This feature is basically the ratio between the amount of loan credited to the annual annuity of the loan applicant. The result which we got after modeling and fine tuning provides confidence that it will be able to successfully predict applicants’ credit worthiness. Their might be some inaccurate feature selections in our work as we got a decreased score on our kaggle submission. We will be analyzing what features are causing this and add or remove some more features to improve our score.

This is the third iteration of our model and we improved the inaccurate feature selections thus also improving the Kaggle submission. We also implemented a deep learning algorithm - Multi-layer Perceptron and generated a classified model to aid our implementation. We made use of TensorBoard to visualize our training model in real-time. We found that by improving feature selection and balancing the data, we are achieving better results. We also found that Multi-layer Perceptron model performs better than Single Layer Perceptron model.

Kaggle Submission

file:///N/u/vshriram/Carbonate/Desktop/kagglePhase3.pngimage.png

References

Some of the material in this notebook has been adopted from here

  1. Understanding AUC - ROC Curve
  2. Better Heatmaps and Correlation Matrix Plots in Python
  3. Bar Plots and Modern Alternatives
  4. Data Visualization using Matplotlib
  5. sklearn.ensemble.RandomForestClassifier
  6. sklearn.ensemble.RandomForestClassifier
  7. Feature Engineering
  8. https://machinelearningmastery.com/cross-entropy-for-machine-learning/
  9. https://towardsdatascience.com/multilayer-perceptron-explained-with-a-real-life-example-and-python-code-sentiment-analysis-cb408ee93141

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: